{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a5e778b7",
   "metadata": {},
   "source": [
    "# 第三章 嵌入 Embeddings\n",
    "\n",
    "\n",
    "在本章中，我们将学习 Embeddings（嵌入）。Embeddings 可以理解为一种计算机更容易处理的文本的向量表示，通常用于将离散的、高维的数据表示（如单词、句子或文档）转换为连续的、低维的向量表示。\n",
    "\n",
    "Embeddings 的目的是捕捉数据之间的语义和语法关系，使得相似的数据在嵌入空间中更接近。例如，在自然语言处理中，可以使用 Word Embeddings （词嵌入）将单词映射到连续的向量空间，使得具有相似含义的单词在嵌入空间中距离更近。这种特性也使得它们成为 LLM（大语言模型）中最重要的组成部分之一。\n",
    "\n",
    "在本章课程中，我们需要用到 Cohere 的 API key。\n",
    "\n",
    "## 目录\n",
    "\n",
    "- [一、环境配置](#一、)\n",
    "\n",
    "- [二、词嵌入（Word Embeddings）](#二、)\n",
    "\n",
    "  - [2.1 理解嵌入的概念](#2.1)\n",
    "  \n",
    "  - [2.2 实现词嵌入](#2.2)\n",
    "  \n",
    "- [三、句嵌入（Sentence Embeddings）](#三、)\n",
    "\n",
    "  - [3.1 理解句嵌入的概念](#3.1)\n",
    "  \n",
    "  - [3.2 实现句嵌入](#3.2)\n",
    "  \n",
    "- [四、文档嵌入（Articles Embeddings）](#四、)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1e30b4c0",
   "metadata": {},
   "source": [
    "## 一、环境配置  <a id=\"一、\"></a>\n",
    "\n",
    "让我们先准备好需要用到的一些 Python 库和 API。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4568b97a",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install cohere umap-learn altair datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "68949b1f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 下面的代码可以帮助我们加载需要用到的 API\n",
    "import os\n",
    "from dotenv import load_dotenv, find_dotenv\n",
    "_ = load_dotenv(find_dotenv()) # 读取本地 .env 文件"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "656bf0c1",
   "metadata": {},
   "source": [
    "我们需要导入 Cohere 库，并且使用 API 密钥创建一个 Cohere 客户端。\n",
    "\n",
    "Cohere 库是一个包含调用大语言模型的函数的库，可以通过 API 调用来调用这些函数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8a4844a0",
   "metadata": {},
   "outputs": [],
   "source": [
    "import cohere\n",
    "co = cohere.Client(os.environ['COHERE_API_KEY'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "135c18f5",
   "metadata": {},
   "source": [
    "此外，我们还需要导入 Pandas 库，它可以用于数据分析与数据处理。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "2a0e8869",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "50da3e10",
   "metadata": {},
   "source": [
    "## 二、词嵌入（Word Embeddings）<a id=\"二、\"></a>\n",
    "\n",
    "### 2.1 理解嵌入的概念  <a id=\"2.1\"></a>\n",
    "\n",
    "让我们先学习什么是 Embeddings。\n",
    "\n",
    "在这里，我们有一个带有横轴、纵轴以及坐标值的网格，我们可以看到一堆单词位于这个网格中。\n",
    "如果要把单词放到合适的位置，你会把单词\"apple\"（译为“苹果”）放在哪里？\n",
    "\n",
    "![Alt text](images/3-1.png)\n",
    "\n",
    "正如你在这个网格中所看到的，相似的单词被分组在一起。\n",
    "所以在左上方有足球、篮球、乒乓球，左下方有房屋、建筑和城堡，右下方有自行车和汽车等交通工具，右上方有水果。\n",
    "因此，apple 将被归类为右上方的水果。\n",
    "然后，我们将表中的每个单词与坐标轴关联起来。这里 apple 的坐标是（5，5）。\n",
    "\n",
    "这就是一种 Embeddings ，它将每个单词映射为两个数值组成的向量。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "646832bd",
   "metadata": {},
   "source": [
    "一般来说，Embeddings 会将单词映射到更多的数值。我们会有尽可能多的单词，为了将每个单词都能表示，实际使用的 Embeddings 可以将一个单词映射到数百个数值，甚至数千个。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a3859edc",
   "metadata": {},
   "source": [
    "### 2.2 实现词嵌入  <a id=\"2.2\"></a>\n",
    "\n",
    "我们将使用一个非常小的数据表。它包含三个单词：joy（欢乐）、happiness（快乐） 和 potato（马铃薯） ，我们用 Pandas 将其创建，如下所示："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "be6ddfdc",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>joy</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>happiness</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>potato</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        text\n",
       "0        joy\n",
       "1  happiness\n",
       "2     potato"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "three_words = pd.DataFrame({'text':\n",
    "  [\n",
    "      'joy',\n",
    "      'happiness',\n",
    "      'potato'\n",
    "  ]})\n",
    "\n",
    "three_words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "5cf3135d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>欢乐</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>快乐</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>马铃薯</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  text\n",
       "0   欢乐\n",
       "1   快乐\n",
       "2  马铃薯"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 中文版本\n",
    "three_words = pd.DataFrame({'text':\n",
    "  [\n",
    "      '欢乐',\n",
    "      '快乐',\n",
    "      '马铃薯'\n",
    "  ]})\n",
    "\n",
    "three_words"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "063208e1",
   "metadata": {},
   "source": [
    "接下来，我们为这三个单词创建 Embeddings。\n",
    "我们使用 Cohere 库中的 embed 函数来创建这些 Embeddings。\n",
    "embed 函数接受一些输入。第一个输入是我们要嵌入的数据集\"three_words\"，我们还需要指定所使用的列为\"text\"，以及要使用的模型。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bd461b39",
   "metadata": {},
   "outputs": [],
   "source": [
    "three_words_emb = co.embed(texts=list(three_words['text']),\n",
    "                           model='embed-english-v2.0').embeddings  # 英文版本用英文嵌入模型 embed-english-v2.0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7d3ada8f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 中文版本\n",
    "three_words_emb = co.embed(texts=list(three_words['text']),\n",
    "                           model='embed-multilingual-v2.0').embeddings  # 中文版本用多语言嵌入模型 embed-multilingual-v2.0"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b43c14fc",
   "metadata": {},
   "source": [
    "现在让我们来看一下与每个单词相关联的 Embeddings。我们将与单词\"joy\"相关联的 Embeddings 称为\"word_1\"，可以通过查看\"three_words_emb\"的第一行来获取。对\"word_2\"和\"word_3\"我们也做同样的操作。它们是与单词\"happiness\"和\"potato\"对应的 Embeddings 。\n",
    "\n",
    "我们可以打印单词\"joy\"相关联的 Embeddings 的前10个数值看看，即\"word_1\"中的前十个数值。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "074dc2b5",
   "metadata": {},
   "outputs": [],
   "source": [
    "word_1 = three_words_emb[0]\n",
    "word_2 = three_words_emb[1]\n",
    "word_3 = three_words_emb[2]\n",
    "\n",
    "print(word_1[:10])\n",
    "# 注：下面输出的结果是英文版本的结果，如果是中文版本，会有所不同。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "57cf4586",
   "metadata": {},
   "source": [
    "[2.3203125,\n",
    " -0.18334961,\n",
    " -0.578125,\n",
    " -0.7314453,\n",
    " -2.2050781,\n",
    " -2.59375,\n",
    " 0.35205078,\n",
    " -1.6220703,\n",
    " 0.27954102,\n",
    " 0.3083496]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "45febbc8",
   "metadata": {},
   "source": [
    "## 三、句嵌入（Sentence Embeddings）<a id=\"三、\"></a>\n",
    "\n",
    "### 3.1 理解句嵌入的概念  <a id=\"3.1\"></a>\n",
    "\n",
    "Embeddings 不仅可以用于单词，还可以用于更长的文本片段。实际上，它可以是非常长的文本片段。\n",
    "\n",
    "![Alt text](images/3-2.png)\n",
    "\n",
    "在这个例子中，我们有一些句子的 Embeddings。\n",
    "现在这些句子被转化为一个向量或数值列表。\n",
    "请注意，第一个句子是\"hello, how are you?\"，最后一个句子是\"Hi, how's it going?\"。\n",
    "它们没有相同的单词，但它们的句意非常相似，所以 Embeddings 会将它们映射到一些非常接近的数值。\n",
    "\n",
    "### 3.2 实现句嵌入  <a id=\"3.2\"></a>\n",
    "\n",
    "准备一个包含多个句子的小型数据集。如你所见，这个数据集有八个句子，它们是成对出现的。每个句子都是前一个句子的答案，例如，\"What color is the sky?\"的答案是\"The sky is blue\"，\"What is an apple?\"的答案是\"An apple is a fruit\"。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "98eddd56",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 创建一个包含八个句子的 DataFrame\n",
    "sentences = pd.DataFrame({'text':\n",
    "  [\n",
    "   'Where is the world cup?',  # 句子1: 世界杯在哪里？\n",
    "   'The world cup is in Qatar',  # 句子2: 世界杯在卡塔尔。\n",
    "   'What color is the sky?',  # 句子3: 天空是什么颜色的？\n",
    "   'The sky is blue',  # 句子4: 天空是蓝色的。\n",
    "   'Where does the bear live?',  # 句子5: 熊住在哪里？\n",
    "   'The bear lives in the woods',  # 句子6: 熊住在森林里。\n",
    "   'What is an apple?',  # 句子7: 苹果是什么？\n",
    "   'An apple is a fruit',  # 句子8: 苹果是一种水果。\n",
    "  ]})\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "26e5bb42",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 中文版本\n",
    "sentences = pd.DataFrame({'text':\n",
    "  [\n",
    "   '世界杯在哪里？',\n",
    "   '世界杯在卡塔尔', \n",
    "   '天空是什么颜色的?', \n",
    "   '天空是蓝色的', \n",
    "   '熊住在哪里？',  \n",
    "   '熊住在森林里', \n",
    "   '苹果是什么?',  \n",
    "   '苹果是一种水果',  \n",
    "  ]})\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e094fe5f",
   "metadata": {},
   "source": [
    "现在，我们仍然使用 Cohere 库中的 embed 函数，将所有这些句子转化为 Embeddings ，并观察哪些句子彼此接近或远离。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "91ece326",
   "metadata": {},
   "outputs": [],
   "source": [
    "emb = co.embed(texts=list(sentences['text']),\n",
    "               model='embed-english-v2.0').embeddings  # 英文版本用英文嵌入模型 embed-english-v2.0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b26903d6",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 中文版本\n",
    "emb = co.embed(texts=list(sentences['text']),\n",
    "               model='embed-multilingual-v2.0').embeddings  # 中文版本用多语言嵌入模型 embed-multilingual-v2.0"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d1f712d9",
   "metadata": {},
   "source": [
    "让我们来看一下每个句子的嵌入的前 3 个数值。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "befe3331",
   "metadata": {},
   "outputs": [],
   "source": [
    "for e in emb:\n",
    "    print(e[:3])\n",
    "# 注：下面输出的结果是英文版本的结果，如果是中文版本，会有所不同。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c7abc3d2",
   "metadata": {},
   "source": [
    "[0.27319336, -0.37768555, -1.0273438]\n",
    "\n",
    "[0.49804688, 1.2236328, 0.4074707]\n",
    "\n",
    "[-0.23571777, -0.9375, 0.9614258]\n",
    "\n",
    "[0.08300781, -0.32080078, 0.9272461]\n",
    "\n",
    "[0.49780273, -0.35058594, -1.6171875]\n",
    "\n",
    "[1.2294922, -1.3779297, -1.8378906]\n",
    "\n",
    "[0.15686035, -0.92041016, 1.5996094]\n",
    "\n",
    "[1.0761719, -0.7211914, 0.9296875]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "45033130",
   "metadata": {},
   "source": [
    "再看看每个句子的 Embeddings 有多少个数值"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4c82dd41",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(len(emb[0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8ddb1ec2",
   "metadata": {},
   "source": [
    "在这个特定的例子中，答案是 4096 个，但不同的 Embeddings 长度也会不同。\n",
    "\n",
    "让我们来可视化一下这个数据集的 Embeddings。\n",
    "我们调用 utils 库里的名为 umap_plot 函数，它会调用 umap 和 altair 包，并生成下面的图。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "df73a83e",
   "metadata": {},
   "outputs": [],
   "source": [
    "from utils import umap_plot\n",
    "# 使用 umap_plot 函数生成图表\n",
    "chart = umap_plot(sentences, emb)\n",
    "# 调用 interactive 方法以显示交互式图表\n",
    "chart.interactive()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dde2cdd3",
   "metadata": {},
   "source": [
    "![Alt text](images/3-3.png)\n",
    "\n",
    "这个图给出了八个点，每个点代表我们的数据集中的一个句子，将鼠标放到点上，会显示出这个点代表哪个句子。\n",
    "\n",
    "我们观察到，句意相似的两个句子之间挨得非常近，比如'Where does the bear live?'和'The bear lives in the the woods'。\n",
    "\n",
    "所以我们可以得到一个结论，Embeddings 会将句意相似的点放在靠近的位置上，句意相差大的点放在相距较远的位置上。\n",
    "通常情况下，与一个问题句意最相似的就是它特定的答案。\n",
    "因此，我们可以通过搜索与问题位置最接近的句子来找到问题的答案。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "218c8b01",
   "metadata": {},
   "source": [
    "## 四、文档嵌入（Articles Embeddings）<a id=\"四、\"></a>\n",
    "\n",
    "现在你已经知道如何对包含八个句子的小数据集进行 Embeddings 了，接下来让我们来处理一个大数据集。\n",
    "\n",
    "我们将使用一个包含维基百科文章的大数据集。\n",
    "它有 2000 篇带有标题的文章、第一段文字的文本以及第一段文字的 Embeddings。\n",
    "让我们用下面的代码加载以下数据集。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "90769a42",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "wiki_articles = pd.read_pickle('wikipedia.pkl')\n",
    "wiki_articles[['title','text','emb']]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e29abd4a",
   "metadata": {},
   "source": [
    "![Alt text](images/3-4.png)\n",
    "\n",
    "我们将导入 Numpy 库和一个帮助我们可视化这个图的函数，这个图与之前的图非常相似。\n",
    "我们将其降维到二维，以便我们观察。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e80cde2c",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from utils import umap_plot_big\n",
    "\n",
    "# 从 wiki_articles 中获取 'title' 和 'text' 列的数据\n",
    "articles = wiki_articles[['title', 'text']]\n",
    "\n",
    "# 从 wiki_articles 中获取 'emb' 列的数据，并转换为 numpy 数组\n",
    "embeds = np.array([d for d in wiki_articles['emb']])\n",
    "\n",
    "# 使用 umap_plot_big 函数生成图表\n",
    "chart = umap_plot_big(articles, embeds)\n",
    "\n",
    "# 调用 interactive 方法以显示交互式图表\n",
    "chart.interactive()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ef6e3514",
   "metadata": {},
   "source": [
    "将鼠标放到图中的点上，可以显示文章的内容。可以观察到到相似的文章位于相似的位置。\n",
    "\n",
    "![Alt text](images/3-5.png)\n",
    "\n",
    "这就是关于 Embeddings 的内容。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
